Investigating CO2 Emissions and Renewable Energy Usage

Author

Noel Lopez, Patrick George, Riley Svensson, Ningjing Hua

Code
library(plotly)
library(tidyverse)
library(broom)
library(knitr)
library(DT)
library(kableExtra)
library(gridExtra)
library(patchwork)
options(scipen = 99999)
energy <- readxl::read_xlsx(here::here("renewable_energy.xlsx"))
co2 <- readxl::read_xlsx(here::here("co2.xlsx"))
Code
convert_to_numeric <- function(x) {
  x <- str_replace_all(x, "k", "e3")
  x <- str_replace_all(x, "M", "e6")
  return(as.numeric(x))
}


co2_clean <- co2 |>
  select(country, `1989`:`2017`) |>
  mutate(across(.cols = `1989`:`2017`, 
                ~ convert_to_numeric(.))) |>
  pivot_longer(cols = !country,
               names_to = "year",
               values_to = "Total_CO2_Emissions")

energy_clean <- energy |>
  pivot_longer(cols = !country,
               names_to = "year",
               values_to = "Percent_Consumption_Renewable")

joined <- inner_join(co2_clean, energy_clean) |>
  drop_na()

joined |>
  distinct(year) |>
  count()

joined |>
  distinct(country) |>
  count()

master <- joined|>
  rename(Year = year) |>
  group_by(Year) |>
  mutate(Year = as.numeric(Year)) |>
  summarize(`Average Emissions` = mean(`Total_CO2_Emissions`),
            `Average Percent Renewable`
            = mean(`Percent_Consumption_Renewable`))

1 Introduction

Climate change, fueled by emissions from energy consumption, is a global challenge jeopardizing humanity’s future. This purpose of this project is to investigate the relationship between CO2 emissions, one of the largest drivers of climate change through producing greenhouse gasses, and the use of renewable energy such as solar, wind, or hydroelectric energy. The master data to be used in this project incorporates data from two different data sets, one measuring total carbon dioxide emissions per country over the years and one measuring proportion of energy use that is renewable over the years. Total_CO2_Emission is in 1000 tons of CO2 and Percent_Consumption_Renewables is a percentage of total energy use which can be classified as renewable. Because the data is collected on individual countries over the years, this investigation will primarily focus on the Average Emissions , every country’s Total_CO2_Emission averaged for every year, as well as the Average Percent Renewable of all the countries in a specific year. The master incorporates the average across 207 distinct countries observed over 29 years. The relationship between Average Emissionsand Average Percent Renewable is hypothesized to be negative, whereas if humanity uses more renewable energy on average there would be less average CO2 emissions released into the environment. Through this investigation more information about climate change and potential solutions to this global problem affecting every organism on planet Earth will be revealed.

2 Materials and Methods

The master data to be used incorporates the average CO2 emissions and average percent renewable energy use for every country studied over every year in the data. The data was sourced through Gapminder, an online database which has been a reliable provider of data since 2005 in hopes of promoting global sustainability through easily accessible information.

Before analysis, the data had to be properly cleaned. The CO2 emissions data contained values of incorrect form such as 25.3k and 4.9M, signaling the units of thousands and millions. The first step in the data cleaning process was to replace these characters with numbers to create a numeric variable type with which calculations could be performed. The data was then pivoted to tidy form and only the years of interest (1989 – 2017) were selected. The two data sets were joined by year and country to produce a master data set which shall be used for model fitting, plotting, and analysis. A final decision was made to drop all the Na’s in the master set. The missing values typically occurred in the Average Percent Renewable variable across the earlier years and usually occurren in countries with smaller populations. Whether these Na’s were included in the original data due to lack of observation or for simply not having any renewable energy production or consumptionis unclear. Because the averages are being studied and the Na’s usually occured in smaller countries, there was still a large enough sample size to average over once those missing observations were dropped. Other forms of imputation were considered such as cell-mean imputation but not ultimately not conducted due to fears of introducing bias to the data.

Linear regression, a method involving predicting the values of one variable, based on another, through producing a straight line minimizing the value for the sum of squared residuals, was used to create a model predicting the relationship. All the data cleaning, models, and subsequent analysis was conducted using R code.


Code
datatable((head(master, n = 50)),
          caption = 'Interactive Preview of Data Set')


3 Analysis and Discussion of Model

Code
energy_emissions_model <- lm(`Average Emissions` ~ 
                             `Average Percent Renewable`,
                             data = master)
summary(energy_emissions_model)
tidy(energy_emissions_model)

Regression Equation:

\[ \hat{y} = 693309.6 - 17770.2 * Average\,Percent\,Renewable \]

Above is the simple linear regression equation based on the model predicting the response variable (\(\hat{y}\)), Average Emissions , by the explanatory variable, Average Percent Renewable. The coefficient on Average Percent Renewable is extremely negative sitting at -17770.2 meaning that for each one percent increase in the Average Percent Renewable energy used the average amount of CO2 emissions in thousands of tons decreases by 17,770.2. The slope coefficient on the y-intercept of 693,309.6 means that when the Average Percent Renewable is zero such that there is no renewable energy being consumed at all on average, the predicted Average Emissions would be 693,309.6 thousand tons of CO2.


Code
# Plot 1
raw_graph <- master |> 
  ggplot(aes(x = `Average Percent Renewable`, 
            y = `Average Emissions`)
       ) +
  geom_point(color = 'aquamarine4', size = 0.9) + 
  geom_smooth(method = "lm", 
              color = 'lightgoldenrod4', 
              size = 0.5,
              se = FALSE) +
  theme(legend.position = "none") +
  labs(x = " Average Renewable Energy Consumption (%)", 
       y = "", 
       title = "Relationship between Renewable Energy Usage and CO2 Emissions", 
       subtitle = 'Average Emissions (1000 tons)') 


ggplotly(raw_graph) |>
  layout(title = list(text = paste0('Relationship between Renewable Energy Usage and CO2 Emissions',
                                    '<br>',
                                    '<sup>',
                                     'Average Emissions (1000 tons)',
                                    '</sup>')))


The plot above demonstrates the negative relationship between Average Emissions and Average Percent Renewable. The distribution illustrates a negative linear relationship, where the points are relatively close to the plotted regression line with little deviation and little noise. There are also little to no unusual observations where points are following this negative trend. This graph is consistent with the hypothesis that as the Average Percent Renewable increases the Average Emissions decreases at a significant rate.

Code
co2_by_year_graph <- master |>
  ggplot(aes(x = Year, y = `Average Emissions`)) +
  geom_point(color = 'firebrick') +
  geom_line(color = 'firebrick') +
  labs( y = '', 
        title = 'Average Emissions Over Time') 
energy_by_year_graph <- master |>
  ggplot(aes(x = Year, y = `Average Percent Renewable`)) +
  geom_point(color = 'green4') +
  geom_line(color = 'green4') +
  labs( y = '', 
        title = 'Average Emissions Over Time') 
ggplotly(co2_by_year_graph) |>
  layout(title = list(text = paste0('Average Emissions Over Time',
                                    '<br>',
                                    '<sup>',
                                     'Average Emissions (1000 tons)',
                                    '</sup>')))
Code
ggplotly(energy_by_year_graph) |>
  layout(title = list(text = paste0('Average Percent Renewable Over Time',
                                    '<br>',
                                    '<sup>',
                                     'Average Renewable Energy Consumption (%)',
                                    '</sup>')))


As shown by the two distributions of Average Emissions and Average Percent Renewable over time, Average Percent Renewable is decreasing over the years while Average Emissions is increasing which illustrates a negative relationship, but perhaps not as expected. As time passes it makes sense as to why Average Emissions is increasing, because of extreme population growth and growing demand for energy production but Average Percent Renewable has shockingly been declining in the recent years. While this likely has something to due to the varying definition of renewable energy, for example whether or not nuclear energy is truly renewable, it is surprising that as technology develops renewable energy use does not. This signifies that humanity needs to increase the renewable energy production and usage on average in an effort to reduce the carbon footprint.

Model Fit:

Code
energy_emissions_model |>
  augment() |>
  summarize(`Variance of Fitted` = var(.fitted),
            `Variance of Residuals` = var(.resid),
            `Variance of Average CO2 Emissions` = var(`Average Emissions`)) |>
  kable(caption = 'Model Fit',
        digits = 10,
        format.args = list(big.mark = ",")) |>
  kable_styling(bootstrap_options = c('striped', 'bordered'))
Model Fit
Variance of Fitted Variance of Residuals Variance of Average CO2 Emissions
415,649,404 46,532,518 462,181,923

The proportion of variability in the response values that was accounted for by the model, \(R^{2}\), was particularly large, at about 89.93 percent. This suggests a high quality model where a about 90% of the variation in the response, Average Emissions is explained by the explanatory variable, Average Percent Renewable in the model. This suggests there are potentially not many other extraneous factors influencing CO2 emissions.

Code
model_stats <- energy_emissions_model |>
  augment() |>
  rename(Residual = .resid,
         Fitted = .fitted) 


residual_plot <- model_stats |>
  ggplot(aes(x = Fitted , y = Residual, color = Residual  >0)) +
  geom_point(show.legend = FALSE) +
  scale_colour_manual(values = 
                        setNames(c('darkslategray4','darkolivegreen4'),
                                 c(T, F))) +
  geom_hline(yintercept=0, 
             linetype='dotted', 
             color = 'deeppink3', show.legend = FALSE) +
  labs(y = '',
       x = 'Fitted Values',
       title = 'Relationship between Residual and Fitted Values') +
  theme(legend.position="none")


ggplotly(residual_plot) |>
  layout(title = list(text = paste0('Relationship between Residual and Fitted Values',
                                    '<br>',
                                    '<sup>',
                                     'Residuals',
                                    '</sup>')))


We can see from the relationship between residuals and fitted values that the linearity is violated in the range of \(1.0\times10^5\) to \(1.4\times10^5\) . This is because in the range of \(1.0\times10^5\) to \(1.2\times10^5\), the observed data is over the estimated linear regression line, which render only positive residues. And in the range of \(1.2\times10^5\) to \(1.4\times10^5\), the observed data is below the linear regression line, which render only negative residuals. In the range of \(1.4\times10^5\) to \(1.6\times10^5\), the linearity is not violated, which means this range of data might fit the linear regression model better. This suggests that the assumption of equal variance of residuals in the model may be violated where the non-equal variance across the range of x-values will have the potential to seriously mis-estimate the variability of the slope. As such, the model and its results must be used cautiously.


3.1 Simulation

Simulation is a critical technology used to develop planning and explore models to optimize decision making within the world of data science. In this section, a basic linear model simulation is used to determine how well the model fits with the presetting conditions, such as adding normally distributed error to the regression line.

The basic procedure for simulation is to:

  1. Train a linear regression fit model for the observed data.
  2. Assuming the model is correct. Add a generated, or simulated, error to the linear regression model, in this case a normally distributed error.
  3. Establish the simulated data and compare values against the observed data by generating value distribution graphs, scatter plots of the relationships modeled and observed.
  4. Analyze and interpret the simulated \(R^2\) value.
  5. Iterate and generate additional simulated data sets.
  6. Check, interpret, and plot the simulated \(R^2\) values for the simulated data sets and compared to that of the observed.

Simulation for a single data set:

Code
noise <- function(x, mean = 0, sd){
  x + rnorm(length(x), 
            mean, 
            sd)
}
master_predict <- predict(energy_emissions_model)
master_sigma <- sigma(energy_emissions_model)

sim_response <- tibble(sim_emissions = noise(master_predict, 
                                          sd = master_sigma))

obs_emissions <- master |>
  ggplot(aes(x = `Average Emissions`)) +
  geom_histogram(binwidth = 3000, fill = 'lightsteelblue4') +
  labs(x = "Observed Average Emissions/1000 tons",
       y = "",
       subtitle = "Count",
       title = "Worldwide Average Emission Counts") +
  theme_bw()+
  theme(plot.title = element_text(hjust = 0, face = 'bold', size = 8.5),
        plot.subtitle = element_text(hjust = 0, size = 7))

sim_emissions_graph <- sim_response |>
  ggplot(aes(x = sim_emissions)) +
  geom_histogram(binwidth = 3500, fill = 'lightsteelblue4') +
  labs(x = "Simulated Average Emissions/1000 tons",
       y = "",
       subtitle = "Count",
       title = "Simulated Worldwide Average Emission Counts") +
  theme_bw()+
  theme(plot.title = element_text(hjust = 0, face = 'bold', size = 8.5),
        plot.subtitle = element_text(hjust = 0, size = 7))

obs_emissions + sim_emissions_graph

We can identify the differences between observed and simulated data from the visualization above looking at the distribution of CO2 emission data. When initially viewing the two graphs, they do not seem to share many similarities however, when we take a closer look, the concentrated value ranges, where most of the data resides, are similar.

The left graph displays the distribution of observed average emissions over the years from 29 different countries. We can see that most of the values fall within the intervals of \(1.1\times10^5\) tons to \(1.2\times10^5\) tons and \(1.6\times10^5\) tons to \(1.7\times10^5\) tons. The simulated yearly average emissions are also concentrated in the same ranges of \(1.1\times10^5\) tons to \(1.2\times10^5\) tons and \(1.5\times10^5\) tons to \(1.7\times10^5\) tons.

Code
sim_data <- master |> 
  filter(!is.na(`Average Emissions`), 
         !is.na(`Average Percent Renewable`)
         ) |> 
  select(`Average Emissions`, `Average Percent Renewable`) |> 
  bind_cols(sim_response)


raw_graph <- master |> 
ggplot(aes(x = `Average Percent Renewable`, 
           y = `Average Emissions`)
       ) +
  geom_point(color = 'lightsalmon3') + 

  theme(legend.position = "none") +
  labs(x = " Avg Renewable Energy Consumption (%)", 
       y = "", 
       title = "Observed Relationships between Avg Renewable\nEnergy Consumption and Avg CO2 Emissions", 
       subtitle = "Avg CO2 Emissions (1000 tons)") +
  theme(plot.title.position = "plot",
        plot.title = element_text(face = 'bold', size = 8.5),
        plot.subtitle = element_text(hjust = 0, size = 7))

sim_master_graph <- sim_data |> 
ggplot(aes(x = `Average Percent Renewable`, 
           y = sim_emissions)
       ) +
  geom_point(color = 'lightsalmon3') + 
  theme(legend.position = "none") +
  labs(x = " Avg Renewable Energy Consumption (%)", 
       y = "", 
       title = "Simulated Relationships between Avg Renewable\nEnergy Consumption and Avg CO2 Emissions", 
       subtitle = "Simulated Avg CO2 Emissions (1000 tons)") +
    theme(plot.title.position = "plot",
        plot.title = element_text(face = 'bold', size = 8.5),
        plot.subtitle = element_text(hjust = 0, size = 7))


raw_graph + sim_master_graph

In these two graphs the relationships between the simulated and observed average renewable energy consumption and average CO2 emissions was plotted. The simulated data on the right graph is more closely related to a straight, modeled linear regression line.

Code
# Check the similarity between the simulated data and observed data
sim_data |> 
  ggplot(aes(x = sim_emissions, 
             y = `Average Emissions`)
         ) + 
  geom_point(color = 'darkolivegreen') + 
   labs(x = "Simulated Avg CO2 Emissions (1000 tons)", 
        y = "",
        title = "The Similarity between Observed and Simulated Data",
        subtitle = "Observed Avg CO2 Emissions (1000 tons)" ) + 
  geom_abline(slope = 1,
              intercept = 0, 
              color = "steelblue",
              linetype = "dashed",
              lwd = 1.5) +
  theme_bw()+
  theme(plot.title = element_text(hjust = 0.5, face = 'bold'))

In this graph, we can examine the similarity between the observed and the simulated data set. Ideally, the data should fall on the blue dashed line, which represents the y = x line. The points above the blue line represent an overestimated simulation, while the points below the line indicate an underestimated simulation. Following an analysis of this study, we can determine that the observed data is typically underestimated. Generally speaking, the simulated values are distributed close to the y = x line, meaning that there is a strong relationship between the observed values and simulated values.

Code
# get the R squared for the simulated data fit for observed data.
p_value <- lm(`Average Emissions` ~ sim_emissions, 
   data = sim_data
   ) |> 
  glance() |>
  select(p.value) |>
  pull()

sim_r2 <- lm(`Average Emissions` ~ sim_emissions, 
             data = sim_data
             ) |> 
  glance() |> 
  select(r.squared) |> 
  pull()

simulated_stats <- tibble(p_value, sim_r2) |>
  rename(`P - Value` = p_value,
         `Simulated R Squared` = sim_r2)


simulated_stats |>
  kable(caption = 'Simulated R Squared and P-Value', digits = 10) |>
  kable_styling(bootstrap_options = c('striped', 'bordered'))
Simulated R Squared and P-Value
P - Value Simulated R Squared
0 0.8356973

The simulated p value is 0, and it is extremely small (p<0.05), which indicates the data is statistically significant.

The simulated \(R^2\) is 0.8356973, which means the simulated data can explain around 0.8356973 of the variation in observed data. In this case, the one time simulation till now did a moderately good job.

Simulation for 1000 data sets:

Since the normally distributed error is randomly added to the linear regression model line, if the simulation is iterated many times, the result will be far more stable. With this method, the entire group of interest will be obtained, rather than a single sample of that group.

In the following steps, 1000 simulated data sets are created and combined with the observed data set into a full new data set. Next, the \(R^2\) of the 1000 simulated data sets is determined and the distribution is plotted.

Code
# Created 1000 simulated dataset
nsims <- 1000
sims <- map_dfc(.x = 1:nsims,
                .f = ~ tibble(sim = noise(master_predict, 
                                          sd = master_sigma)
                              )
                )

# clean the colnames
colnames(sims) <- colnames(sims) |> 
  str_replace(pattern = "\\.\\.\\.",
                  replace = "_")

sims <- master |> 
  filter(!is.na(`Average Emissions`), 
         !is.na(`Average Percent Renewable`)) |> 
  select(`Average Emissions`) |> 
  bind_cols(sims)


# mapping to get 1000 simulated R squared.
sim_r_sq <- sims |> 
  map(~ lm(`Average Emissions` ~ .x, data = sims)) |> 
  map(glance) |> 
  map_dbl(~ .x$r.squared)

# to see the distribution of the 1000 simulated R square.
tibble(sims = sim_r_sq) |> 
  ggplot(aes(x = sims)) + 
  geom_histogram(binwidth = 0.025, fill = 'lightpink3') +
  labs(x = expression("Simulated"~ R^2),
       y = "",
       subtitle = "Number of Simulated Models",
       title = "1000 Simulated R-Squared Distribution")+
  theme(plot.title = element_text(hjust = 0.5, face = 'bold'))

Code
#The distribution of these values will tell if our assumed model does a good job of producing data similar to what was observed. If the model produces data similar to what was observed, we would expect values near 1.
# Our model is R square is around 0.8.

We can see from the distribution of the 1000 simulated \(R^2\) graph that the \(R^2\) have the highest frequency around 0.8 to 0.85, which is in accordance with the single simulated data \(R^2\).

4 Conclusion



While the Earth may seem prosperous to us today, there is only a finite amount of resources, and life, on the planet and the human population just hit 8 billion. Global temperatures continue to increase, threatening life and habitat for many living organisms. Protecting humanity’s environment should be society’s number one priority and thought of with every decision made. Energy use, as well as CO2 emissions is also on the rise while the amount of renewable energy the world is using is decreasing.

Through analyzing this data set it is clear there is a strong, statistically significant effect of using renewable energy tp decrease the amount of CO2 emitted into the atmosphere yet humanity is using less renewables, moving in the wrong direction.

We urge every country to prioritize and help solve this global crisis putting the Earth at risk and use invest more heavily in renewable energy in an attempt to be sustainable.

5 References

  1. de Paula Ferreira, W., Armellini, F., & De Santa-Eulalia, L. A. (2020). Simulation in industry 4.0: A state-of-the-art review. Computers & Industrial Engineering, 149, 106868. https://doi.org/10.1016/J.CIE.2020.106868

  2. Gapminder. (2019). Data. Gapminder.org; Gapminder. https://www.gapminder.org/data/

  3. Scatter and Line Plots in R. (n.d.). Plotly.com. Retrieved March 17, 2023, from https://plotly.com/r/line-and-scatter/#custom-color-scales

  4. Schork, J. (n.d.). How to Use ggplotly in R (2 Examples) | Static to Interactive Plot. Statistics Globe. Retrieved March 17, 2023, from https://statisticsglobe.com/ggplotly-function-r#:~:text=In%20this%20one%2C%20you%E2%80%99ll%20learn%20how%20to%20use

  5. Subtitles with ggplotly. (2019, May 16). https://datascott.com/blog/subtitles-with-ggplotly/ test